Template for test
In [1]:
from pred import Predictor
from pred import sequence_vector
from pred import chemical_vector
Controlling for Random Negatve vs Sans Random in Imbalanced Techniques using S, T, and Y Phosphorylation.
Included is N Phosphorylation however no benchmarks are available, yet.
Training data is from phospho.elm and benchmarks are from dbptm.
In [2]:
par = ["pass", "ADASYN", "SMOTEENN", "random_under_sample", "ncl", "near_miss"]
for i in par:
print("y", i)
y = Predictor()
y.load_data(file="Data/Training/clean_s.csv")
y.process_data(vector_function="chemical", amino_acid="S", imbalance_function=i, random_data=0)
y.supervised_training("forest")
y.benchmark("Data/Benchmarks/phos.csv", "S")
del y
print("x", i)
x = Predictor()
x.load_data(file="Data/Training/clean_s.csv")
x.process_data(vector_function="chemical", amino_acid="S", imbalance_function=i, random_data=1)
x.supervised_training("forest")
x.benchmark("Data/Benchmarks/phos.csv", "S")
del x
y pass
Loading Data
Loaded Data
Working on Data
Finished working with Data
Training Data Points: 162268
Test Data Points: 18030
Starting Training
Done training
Test Results
Failed
TP 0 FP 0 TN 17726 FN 304
None
Number of data points in benchmark 65895
Benchmark Results
Failed
TP 0 FP 0 TN 62517 FN 3378
None
x pass
Loading Data
Loaded Data
Working on Data
Finished working with Data
Random Sequences Generated 180298
Filtering Random Data
Random Data Added: 180298
Finished with Random Data
Training Data Points: 342566
Test Data Points: 18030
Starting Training
Done training
Test Results
Failed
TP 0 FP 0 TN 17735 FN 295
None
Number of data points in benchmark 65895
Benchmark Results
Failed
TP 0 FP 0 TN 62517 FN 3378
None
y ADASYN
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Training Data Points: 319478
Test Data Points: 35498
Starting Training
Done training
Test Results
Sensitivity: 0.7520637951367439
Specificity : 0.49940647787010345
Accuracy: 0.6261479519972957
ROC 0.625735136503
TP 13392 FP 8856 TN 8835 FN 4415
None
Number of data points in benchmark 65895
Benchmark Results
Sensitivity: 0.7773830669034932
Specificity : 0.551705935985412
Accuracy: 0.5632749070490932
ROC 0.664544501444
TP 2626 FP 28026 TN 34491 FN 752
None
x ADASYN
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Random Sequences Generated 180298
Filtering Random Data
Random Data Added: 180298
Finished with Random Data
Training Data Points: 499776
Test Data Points: 35498
Starting Training
Done training
Test Results
Sensitivity: 0.4247230189528148
Specificity : 0.7353389400011289
Accuracy: 0.5797509718857401
ROC 0.580030979477
TP 7552 FP 4689 TN 13028 FN 10229
None
Number of data points in benchmark 65895
Benchmark Results
Sensitivity: 0.4265837773830669
Specificity : 0.7469648255674457
Accuracy: 0.7305410122164049
ROC 0.586774301475
TP 1441 FP 15819 TN 46698 FN 1937
None
y SMOTEENN
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Training Data Points: 290126
Test Data Points: 32237
Starting Training
Done training
Test Results
Sensitivity: 0.7783896162274541
Specificity : 0.6299861495844875
Accuracy: 0.7119148804169122
ROC 0.704187882906
TP 13853 FP 5343 TN 9097 FN 3944
None
Number of data points in benchmark 65895
Benchmark Results
Sensitivity: 0.6042036708111308
Specificity : 0.5537853703792568
Accuracy: 0.556369982547993
ROC 0.578994520595
TP 2041 FP 27896 TN 34621 FN 1337
None
x SMOTEENN
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Random Sequences Generated 180298
Filtering Random Data
Random Data Added: 180298
Finished with Random Data
Training Data Points: 470399
Test Data Points: 32234
Starting Training
Done training
Test Results
Sensitivity: 0.6578265765765766
Specificity : 0.7668923587121735
Accuracy: 0.7068002730036608
ROC 0.712359467644
TP 11683 FP 3374 TN 11100 FN 6077
None
Number of data points in benchmark 65895
Benchmark Results
Sensitivity: 0.5076968620485495
Specificity : 0.5862085512740535
Accuracy: 0.582183777221337
ROC 0.546952706661
TP 1715 FP 25869 TN 36648 FN 1663
None
y random_under_sample
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Training Data Points: 5691
Test Data Points: 633
Starting Training
Done training
Test Results
Sensitivity: 0.7169811320754716
Specificity : 0.5111111111111111
Accuracy: 0.6145339652448657
ROC 0.614046121593
TP 228 FP 154 TN 161 FN 90
None
Number of data points in benchmark 65895
Benchmark Results
Sensitivity: 0.7986974541148608
Specificity : 0.5775389094166387
Accuracy: 0.5888762425070188
ROC 0.688118181766
TP 2698 FP 26411 TN 36106 FN 680
None
x random_under_sample
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Random Sequences Generated 180298
Filtering Random Data
Random Data Added: 180298
Finished with Random Data
Training Data Points: 185989
Test Data Points: 633
Starting Training
Done training
Test Results
Sensitivity: 0.21428571428571427
Specificity : 0.7723076923076924
Accuracy: 0.5007898894154819
ROC 0.493296703297
TP 66 FP 74 TN 251 FN 242
None
Number of data points in benchmark 65895
Benchmark Results
Sensitivity: 0.216696269982238
Specificity : 0.8225282723099318
Accuracy: 0.7914712800667729
ROC 0.519612271146
TP 732 FP 11095 TN 51422 FN 2646
None
y ncl
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Training Data Points: 156883
Test Data Points: 17432
Starting Training
Done training
Test Results
Failed
TP 0 FP 0 TN 17116 FN 316
None
Number of data points in benchmark 65895
Benchmark Results
Failed
TP 0 FP 0 TN 62517 FN 3378
None
x ncl
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Random Sequences Generated 180298
Filtering Random Data
Random Data Added: 180298
Finished with Random Data
Training Data Points: 337181
Test Data Points: 17432
Starting Training
Done training
Test Results
Failed
TP 0 FP 0 TN 17108 FN 324
None
Number of data points in benchmark 65895
Benchmark Results
Failed
TP 0 FP 0 TN 62517 FN 3378
None
y near_miss
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Training Data Points: 5691
Test Data Points: 633
Starting Training
Done training
Test Results
Sensitivity: 0.5928338762214984
Specificity : 0.7331288343558282
Accuracy: 0.665086887835703
ROC 0.662981355289
TP 182 FP 87 TN 239 FN 125
None
Number of data points in benchmark 65895
Benchmark Results
Sensitivity: 0.6548253404381291
Specificity : 0.2865620551210071
Accuracy: 0.3054404734805372
ROC 0.47069369778
TP 2212 FP 44602 TN 17915 FN 1166
None
x near_miss
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Random Sequences Generated 180298
Filtering Random Data
Random Data Added: 180298
Finished with Random Data
Training Data Points: 185989
Test Data Points: 633
Starting Training
Done training
Test Results
Sensitivity: 0.19873817034700317
Specificity : 0.8924050632911392
Accuracy: 0.5450236966824644
ROC 0.545571616819
TP 63 FP 34 TN 282 FN 254
None
Number of data points in benchmark 65895
Benchmark Results
Sensitivity: 0.1965660153937241
Specificity : 0.8020058544076011
Accuracy: 0.7709689657788906
ROC 0.499285934901
TP 664 FP 12378 TN 50139 FN 2714
None
Y Phosphorylation
In [3]:
par = ["pass", "ADASYN", "SMOTEENN", "random_under_sample", "ncl", "near_miss"]
for i in par:
print("y", i)
y = Predictor()
y.load_data(file="Data/Training/clean_Y.csv")
y.process_data(vector_function="chemical", amino_acid="Y", imbalance_function=i, random_data=0)
y.supervised_training("forest")
y.benchmark("Data/Benchmarks/phos.csv", "Y")
del y
print("x", i)
x = Predictor()
x.load_data(file="Data/Training/clean_Y.csv")
x.process_data(vector_function="chemical", amino_acid="Y", imbalance_function=i, random_data=1)
x.supervised_training("forest")
x.benchmark("Data/Benchmarks/phos.csv", "Y")
del x
y pass
Loading Data
Loaded Data
Working on Data
Finished working with Data
Training Data Points: 11334
Test Data Points: 1260
Starting Training
Done training
Test Results
Failed
TP 0 FP 0 TN 1216 FN 44
None
Number of data points in benchmark 23555
Benchmark Results
Failed
TP 0 FP 0 TN 23514 FN 41
None
x pass
Loading Data
Loaded Data
Working on Data
Finished working with Data
Random Sequences Generated 12594
Filtering Random Data
Random Data Added: 12594
Finished with Random Data
Training Data Points: 23928
Test Data Points: 1260
Starting Training
Done training
Test Results
Failed
TP 0 FP 0 TN 1212 FN 48
None
Number of data points in benchmark 23555
Benchmark Results
Failed
TP 0 FP 0 TN 23514 FN 41
None
y ADASYN
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Training Data Points: 21744
Test Data Points: 2416
Starting Training
Done training
Test Results
Sensitivity: 0.6951915240423798
Specificity : 0.5458368376787216
Accuracy: 0.6216887417218543
ROC 0.620514180861
TP 853 FP 540 TN 649 FN 374
None
Number of data points in benchmark 23555
Benchmark Results
Sensitivity: 1.0
Specificity : 0.5932635876499107
Accuracy: 0.593971555932923
ROC 0.796631793825
TP 41 FP 9564 TN 13950 FN 0
None
x ADASYN
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Random Sequences Generated 12594
Filtering Random Data
Random Data Added: 12594
Finished with Random Data
Training Data Points: 34338
Test Data Points: 2416
Starting Training
Done training
Test Results
Sensitivity: 0.4297253634894992
Specificity : 0.7461799660441426
Accuracy: 0.5840231788079471
ROC 0.587952664767
TP 532 FP 299 TN 879 FN 706
None
Number of data points in benchmark 23555
Benchmark Results
Sensitivity: 1.0
Specificity : 0.7297780045930085
Accuracy: 0.730248354914031
ROC 0.864889002297
TP 41 FP 6354 TN 17160 FN 0
None
y SMOTEENN
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Training Data Points: 18132
Test Data Points: 2015
Starting Training
Done training
Test Results
Sensitivity: 0.8469055374592834
Specificity : 0.4498094027954257
Accuracy: 0.6918114143920595
ROC 0.648357470127
TP 1040 FP 433 TN 354 FN 188
None
Number of data points in benchmark 23555
Benchmark Results
Sensitivity: 1.0
Specificity : 0.4108190865016586
Accuracy: 0.41184461897686264
ROC 0.705409543251
TP 41 FP 13854 TN 9660 FN 0
None
x SMOTEENN
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Random Sequences Generated 12594
Filtering Random Data
Random Data Added: 12594
Finished with Random Data
Training Data Points: 30740
Test Data Points: 2017
Starting Training
Done training
Test Results
Sensitivity: 0.7210264900662252
Specificity : 0.6477132262051916
Accuracy: 0.6916212196331185
ROC 0.684369858136
TP 871 FP 285 TN 524 FN 337
None
Number of data points in benchmark 23555
Benchmark Results
Sensitivity: 1.0
Specificity : 0.6045759972782172
Accuracy: 0.6052642751008278
ROC 0.802287998639
TP 41 FP 9298 TN 14216 FN 0
None
y random_under_sample
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Training Data Points: 1009
Test Data Points: 113
Starting Training
Done training
Test Results
Sensitivity: 0.6862745098039216
Specificity : 0.6290322580645161
Accuracy: 0.6548672566371682
ROC 0.657653383934
TP 35 FP 23 TN 39 FN 16
None
Number of data points in benchmark 23555
Benchmark Results
Sensitivity: 1.0
Specificity : 0.6414051203538318
Accuracy: 0.6420292931437063
ROC 0.820702560177
TP 41 FP 8432 TN 15082 FN 0
None
x random_under_sample
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Random Sequences Generated 12594
Filtering Random Data
Random Data Added: 12594
Finished with Random Data
Training Data Points: 13603
Test Data Points: 113
Starting Training
Done training
Test Results
Sensitivity: 0.2807017543859649
Specificity : 0.9107142857142857
Accuracy: 0.5929203539823009
ROC 0.59570802005
TP 16 FP 5 TN 51 FN 41
None
Number of data points in benchmark 23555
Benchmark Results
Failed
TP 0 FP 3028 TN 20486 FN 41
None
y ncl
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Training Data Points: 10404
Test Data Points: 1157
Starting Training
Done training
Test Results
Failed
TP 0 FP 0 TN 1096 FN 61
None
Number of data points in benchmark 23555
Benchmark Results
Failed
TP 0 FP 0 TN 23514 FN 41
None
x ncl
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Random Sequences Generated 12594
Filtering Random Data
Random Data Added: 12594
Finished with Random Data
Training Data Points: 22998
Test Data Points: 1157
Starting Training
Done training
Test Results
Failed
TP 0 FP 0 TN 1105 FN 52
None
Number of data points in benchmark 23555
Benchmark Results
Failed
TP 0 FP 0 TN 23514 FN 41
None
y near_miss
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Training Data Points: 1009
Test Data Points: 113
Starting Training
Done training
Test Results
Sensitivity: 0.3695652173913043
Specificity : 0.835820895522388
Accuracy: 0.6460176991150443
ROC 0.602693056457
TP 17 FP 11 TN 56 FN 29
None
Number of data points in benchmark 23555
Benchmark Results
Sensitivity: 0.024390243902439025
Specificity : 0.5697882112783873
Accuracy: 0.5688388877096158
ROC 0.29708922759
TP 1 FP 10116 TN 13398 FN 40
None
x near_miss
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Random Sequences Generated 12594
Filtering Random Data
Random Data Added: 12594
Finished with Random Data
Training Data Points: 13603
Test Data Points: 113
Starting Training
Done training
Test Results
Sensitivity: 0.04
Specificity : 0.9682539682539683
Accuracy: 0.5575221238938053
ROC 0.504126984127
TP 2 FP 2 TN 61 FN 48
None
Number of data points in benchmark 23555
Benchmark Results
Failed
TP 0 FP 568 TN 22946 FN 41
None
T Phosphorylation
In [4]:
par = ["pass", "ADASYN", "SMOTEENN", "random_under_sample", "ncl", "near_miss"]
for i in par:
print("y", i)
y = Predictor()
y.load_data(file="Data/Training/clean_t.csv")
y.process_data(vector_function="chemical", amino_acid="T", imbalance_function=i, random_data=0)
y.supervised_training("forest")
y.benchmark("Data/Benchmarks/phos.csv", "T")
del y
print("x", i)
x = Predictor()
x.load_data(file="Data/Training/clean_t.csv")
x.process_data(vector_function="chemical", amino_acid="T", imbalance_function=i, random_data=1)
x.supervised_training("forest")
x.benchmark("Data/Benchmarks/phos.csv", "T")
del x
y pass
Loading Data
Loaded Data
Working on Data
Finished working with Data
Training Data Points: 58739
Test Data Points: 6527
Starting Training
Done training
Test Results
Failed
TP 0 FP 0 TN 6359 FN 168
None
Number of data points in benchmark 47730
Benchmark Results
Failed
TP 0 FP 0 TN 46497 FN 1233
None
x pass
Loading Data
Loaded Data
Working on Data
Finished working with Data
Random Sequences Generated 65266
Filtering Random Data
Random Data Added: 65266
Finished with Random Data
Training Data Points: 124005
Test Data Points: 6527
Starting Training
Done training
Test Results
Failed
TP 0 FP 0 TN 6398 FN 129
None
Number of data points in benchmark 47730
Benchmark Results
Failed
TP 0 FP 0 TN 46497 FN 1233
None
y ADASYN
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Training Data Points: 114707
Test Data Points: 12746
Starting Training
Done training
Test Results
Sensitivity: 0.7656447534766119
Specificity : 0.5244624493611717
Accuracy: 0.64420210262043
ROC 0.645053601419
TP 4845 FP 3052 TN 3366 FN 1483
None
Number of data points in benchmark 47730
Benchmark Results
Sensitivity: 0.7420924574209246
Specificity : 0.5947695550250554
Accuracy: 0.5985753195055521
ROC 0.668431006223
TP 915 FP 18842 TN 27655 FN 318
None
x ADASYN
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Random Sequences Generated 65266
Filtering Random Data
Random Data Added: 65266
Finished with Random Data
Training Data Points: 179973
Test Data Points: 12746
Starting Training
Done training
Test Results
Sensitivity: 0.3458575695645339
Specificity : 0.8111198120595144
Accuracy: 0.578926722108897
ROC 0.578488690812
TP 2200 FP 1206 TN 5179 FN 4161
None
Number of data points in benchmark 47730
Benchmark Results
Sensitivity: 0.45174371451743717
Specificity : 0.8338817558122029
Accuracy: 0.8240100565681961
ROC 0.642812735165
TP 557 FP 7724 TN 38773 FN 676
None
y SMOTEENN
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Training Data Points: 102158
Test Data Points: 11351
Starting Training
Done training
Test Results
Sensitivity: 0.8732240437158469
Specificity : 0.5080873433077234
Accuracy: 0.7141221037794027
ROC 0.690655693512
TP 5593 FP 2433 TN 2513 FN 812
None
Number of data points in benchmark 47730
Benchmark Results
Sensitivity: 0.8150851581508516
Specificity : 0.48852614147149276
Accuracy: 0.4969620783574272
ROC 0.651805649811
TP 1005 FP 23782 TN 22715 FN 228
None
x SMOTEENN
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Random Sequences Generated 65266
Filtering Random Data
Random Data Added: 65266
Finished with Random Data
Training Data Points: 167398
Test Data Points: 11349
Starting Training
Done training
Test Results
Sensitivity: 0.7189644416718652
Specificity : 0.7271622442779015
Accuracy: 0.7225306194378359
ROC 0.723063342975
TP 4610 FP 1347 TN 3590 FN 1802
None
Number of data points in benchmark 47730
Benchmark Results
Sensitivity: 0.7939983779399837
Specificity : 0.6146202980837473
Accuracy: 0.619254137858789
ROC 0.704309338012
TP 979 FP 17919 TN 28578 FN 254
None
y random_under_sample
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Training Data Points: 2710
Test Data Points: 302
Starting Training
Done training
Test Results
Sensitivity: 0.7315436241610739
Specificity : 0.5032679738562091
Accuracy: 0.6158940397350994
ROC 0.617405799009
TP 109 FP 76 TN 77 FN 40
None
Number of data points in benchmark 47730
Benchmark Results
Sensitivity: 0.8029197080291971
Specificity : 0.6043400649504269
Accuracy: 0.6094699350513304
ROC 0.70362988649
TP 990 FP 18397 TN 28100 FN 243
None
x random_under_sample
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Random Sequences Generated 65266
Filtering Random Data
Random Data Added: 65266
Finished with Random Data
Training Data Points: 67976
Test Data Points: 302
Starting Training
Done training
Test Results
Sensitivity: 0.23333333333333334
Specificity : 0.8355263157894737
Accuracy: 0.5364238410596026
ROC 0.534429824561
TP 35 FP 25 TN 127 FN 115
None
Number of data points in benchmark 47730
Benchmark Results
Sensitivity: 0.2441200324412003
Specificity : 0.85517345205067
Accuracy: 0.839388225434737
ROC 0.549646742246
TP 301 FP 6734 TN 39763 FN 932
None
y ncl
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Training Data Points: 56144
Test Data Points: 6239
Starting Training
Done training
Test Results
Failed
TP 0 FP 0 TN 6093 FN 146
None
Number of data points in benchmark 47730
Benchmark Results
Failed
TP 0 FP 0 TN 46497 FN 1233
None
x ncl
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Random Sequences Generated 65266
Filtering Random Data
Random Data Added: 65266
Finished with Random Data
Training Data Points: 121410
Test Data Points: 6239
Starting Training
Done training
Test Results
Failed
TP 0 FP 0 TN 6083 FN 156
None
Number of data points in benchmark 47730
Benchmark Results
Failed
TP 0 FP 0 TN 46497 FN 1233
None
y near_miss
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Training Data Points: 2710
Test Data Points: 302
Starting Training
Done training
Test Results
Sensitivity: 0.5174825174825175
Specificity : 0.8553459119496856
Accuracy: 0.695364238410596
ROC 0.686414214716
TP 74 FP 23 TN 136 FN 69
None
Number of data points in benchmark 47730
Benchmark Results
Sensitivity: 0.7064071370640713
Specificity : 0.3291395143772716
Accuracy: 0.3388853970249319
ROC 0.517773325721
TP 871 FP 31193 TN 15304 FN 362
None
x near_miss
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Random Sequences Generated 65266
Filtering Random Data
Random Data Added: 65266
Finished with Random Data
Training Data Points: 67976
Test Data Points: 302
Starting Training
Done training
Test Results
Sensitivity: 0.11764705882352941
Specificity : 0.9731543624161074
Accuracy: 0.5397350993377483
ROC 0.54540071062
TP 18 FP 4 TN 145 FN 135
None
Number of data points in benchmark 47730
Benchmark Results
Sensitivity: 0.13463098134630982
Specificity : 0.9718691528485709
Accuracy: 0.9502409386130316
ROC 0.553250067097
TP 166 FP 1308 TN 45189 FN 1067
None
In [ ]:
Content source: vzg100/Post-Translational-Modification-Prediction
Similar notebooks: